Skip to content

Conversation

@anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Sep 16, 2024

Closes #2255

Partially reusing the changes, which were removed in #2142 (namely the part related to using iteration count instead of time) solves the problem of not having enough data for "CV" column.

Old comments

CI status:

UPD: For some reason this greatly affects the mean time. However, if I reduce warmup, the mean does not deteriorate as much.

@anmyachev anmyachev marked this pull request as ready for review September 30, 2024 13:29
@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 30, 2024

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

  • Triton ADV: -5%
  • Triton DFT: -2%
  • Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of ​​running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

@etiotto
Copy link
Contributor

etiotto commented Oct 1, 2024

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

  • Triton ADV: -5%
  • Triton DFT: -2%
  • Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of ​​running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks.

Copy link
Contributor

@etiotto etiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.

The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.

v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype)
sm_scale = 0.125
quantiles = [0.5, 0.0, 1.0]
warmup, rep = 10, 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.

@whitneywhtsang
Copy link
Contributor

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@anmyachev
Copy link
Contributor Author

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

@whitneywhtsang
Copy link
Contributor

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

@etiotto
Copy link
Contributor

etiotto commented Oct 7, 2024

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

I agree. To me "rep" should really mean the number of repetition of the kernel after warmup. I think we can diverge from upstream here, and then try to upstream our changes.

Comment on lines -83 to -85
# compute number of warmup and repeat
n_warmup = max(1, int(warmup / estimate_ms))
n_repeat = max(1, int(rep / estimate_ms))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.

Comment on lines -193 to -195
# compute number of warmup and repeat
n_warmup = max(1, int(warmup / estimate_ms))
n_repeat = max(1, int(rep / estimate_ms))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.

Comment on lines +152 to +154
# compute warmup and repeat times
warmup_time = n_warmup * estimate_ms
rep_time = n_repeat * estimate_ms
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I translate the parameters into those that upstream (triton_do_bench) understands.

@anmyachev anmyachev changed the title Increase warmup and rep for FA benchmark Use iteration count instead of time for parameters warmup and rep of do_bench* functions for benchmarks Oct 14, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

I agree. To me "rep" should really mean the number of repetition of the kernel after warmup. I think we can diverge from upstream here, and then try to upstream our changes.

@etiotto @whitneywhtsang ready for review

@whitneywhtsang
Copy link
Contributor

Please update PR description.

@etiotto etiotto merged commit d3a8eb0 into main Oct 15, 2024
6 checks passed
@etiotto etiotto deleted the amyachev/bench-time branch October 15, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Increase warmup and rep for FA benchmark

5 participants